Red Wine Quality EDA - Angelica Zhang

# load the ggplot package 
library(ggplot2)
library(GGally)
library(scales)
library(memisc)
## Loading required package: lattice
## Loading required package: MASS
## 
## Attaching package: 'memisc'
## The following object is masked from 'package:scales':
## 
##     percent
## The following object is masked from 'package:ggplot2':
## 
##     syms
## The following objects are masked from 'package:stats':
## 
##     contr.sum, contr.treatment, contrasts
## The following object is masked from 'package:base':
## 
##     as.array
library(ggcorrplot)
getwd()
## [1] "/Users/tianlingmengyu/Desktop/EDA_red_wine_quality"
setwd('/Users/tianlingmengyu/Documents/R')

wine <- read.csv("wineQualityReds.csv")

Summary of the Data Set

names(wine)
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
str(wine)
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
summary(wine)
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000
length(wine$X)
## [1] 1599
#remove the X column
wine <- subset(wine, select = -c(X))

What is the structure of your dataset?

# Univariate Plots Section

univariate_plot <- function(varname) {
  return(ggplot(aes(x = varname), data = wine) +
           geom_histogram() +
           xlab(varname))
  return(summary(varname))
}

##fixed.acidity
univariate_plot(wine$fixed.acidity) + xlab('fix.acidity')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(wine$fixed.acidity)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

Most wine in this dataset have fixed.acidity between 4.60 to 15.90, median 7.90 and mean 8.32.

##volatile.acidity
univariate_plot(wine$volatile.acidity) + xlab('volatile.acidity')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(wine$volatile.acidity)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

Wine in the dataset have volatile.acidity between 0.1200 to 1.5800,if the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste, so most volatile.acidity is lower than 0.8

##citric.acid
univariate_plot(wine$citric.acid) + xlab('citric.acid')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Adjust the binwidth in order to find more details

ggplot(aes(x = citric.acid), data = wine) +
  geom_histogram(binwidth = 0.01)

citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines.

The Vinho Verde wine is known by the fresh taste.So, we can found the citric acid in most wine.

summary(wine$citric.acid)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

Found in small quantities, citric acid can add ‘freshness’ and flavor to wines.So reset the binwidth to have a look at.The largest peak is 0.

##residual.sugar histogram
univariate_plot(wine$residual.sugar) + xlab('residual.sugar')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(wine$residual.sugar)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

residual.sugar is one of the most import feature to category the red wine. Most wine are less than 4 g/dm^3,which means most of wine in this dataset is dry type wine.

##chlorides histogram
univariate_plot(wine$chlorides) + xlab('chlorides')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Adjust the binwidth and x scale in order to see the majority of the chlorides factor

##chlorides histogram adjusted
ggplot(aes(x = chlorides), data = wine) +
  geom_histogram(binwidth = 0.005) +
  scale_x_continuous(limits = c(0,quantile(wine$chlorides,0.95)))
## Warning: Removed 80 rows containing non-finite values (stat_bin).

summary(wine$chlorides)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

The minimum is 0.012, median is 0.079, mean is 0.874, the max is 0.611, majority is between 0.025 to 0.125.

##free.sulfur.dioxide histogram
univariate_plot(wine$free.sulfur.dioxide) + xlab('free.sulfur.dioxide')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(wine$free.sulfur.dioxide)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

##total.sulfur.dioxide histogram
univariate_plot(wine$total.sulfur.dioxide) + xlab('total.sulfur.dioxide')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(wine$total.sulfur.dioxide)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

so,adjust the x scale

wine$sulfur.bucket[wine$total.sulfur.dioxide >= 50] <- 'sulfur_above_50'
wine$sulfur.bucket[wine$total.sulfur.dioxide < 50] <- 'sulfur_below_50'
##total.sulfur.dioxide histogram
ggplot(aes(x = total.sulfur.dioxide,color = sulfur.bucket), data = wine) +
  geom_histogram(binwidth = 1) +
  facet_wrap(~sulfur.bucket) +
  scale_x_continuous(limits = c(0,200))
## Warning: Removed 2 rows containing non-finite values (stat_bin).

##pH histogram
univariate_plot(wine$pH) + xlab('pH')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(wine$pH)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

Normally, red wine ph between 3.3 to 3.8. The red wine in the dataset comes from Portugal, where the light is not very strong, so the wine has a lower pH value than the wine in Australia, we can see the same trend in the pH chart.

##sulphates histogram
univariate_plot(wine$sulphates) + xlab('sulphates')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(wine$sulphates)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000
##alcohol histogram
univariate_plot(wine$alcohol) + xlab('alcohol')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

According to wikipedia, we will know that the alcohol for Vinhos Verdes is between 8.4 to 14, we can see the same distribution in the histogram chart.

summary(wine$alcohol)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90
##density histogram
univariate_plot(wine$density) + xlab('density')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The density of water is about 1.0 (g / cm^3), and the density of acohol is 0.789 (g / cm^3), according the alcohol distribution chart, we can theorize about the distribution here, and we can find the same trend in the density chart.

summary(wine$density)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037
##quality histogram
ggplot(aes(x = quality), data = wine) +
  geom_bar()

Even the quality score is 0 to 10, but most wine in this dataset have quality between 3 to 8, and nearly normal distribution.The majority between 5 to 6.

summary(wine$quality)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000
dim(subset(wine,quality==5|quality ==6))/dim(wine)
## [1] 0.8248906 1.0000000

Wine of score 5 and 6 is the majority, and 82.5% of the wine in dataset.

#set the seed for reproducible results
set.seed(10433)
ggcorr(wine,
       method = c('all.obs','spearman'),
       nbreaks = 5,palette = 'PuOr',label = TRUE,
       names = 'spearman correlation coeffience',
       hjust = 0.9, angle = -70,size = 3) +
  ggtitle('Spearman Correlation coefficient Matirx')
## Warning in ggcorr(wine, method = c("all.obs", "spearman"), nbreaks = 5, :
## data in column(s) 'sulfur.bucket' are not numeric and were ignored
## Warning: Ignoring unknown parameters: names

Univariate Analysis

main feature(s) of interest in your dataset

quality

other features in the dataset will help to investigation into your feature(s) of interest

fixed.acidity citric.acid residual.sugar alcohol

#create the quality.bucket to category the quality 
wine$quality.bucket[wine$quality < 9] <- 'Great'
wine$quality.bucket[wine$quality < 7] <- 'Good'
wine$quality.bucket[wine$quality < 4] <- 'Bad'

Category the quality to 3 types :

0-3 Bad/ 4-6 Good/ 7-10 Great/

##fixed.acidity vs. quality
ggplot(aes(x = fixed.acidity, y = quality), 
       data = wine) +
  geom_jitter(alpha = 0.5, color = "#2E70B2", fill = "#6FA9E2") +
  geom_smooth(method=lm, se=FALSE, color = 'black')

In the frist case, I thought the fixed.acidity will show some trend, because I heard the specialist will perfer acidity taste than sugar taste.

But according to the plot, we can find there is no clear relationship.

## citric.acid vs. quality
ggplot(aes(x = citric.acid, y = quality), 
       data = wine) +
  geom_jitter(alpha = 0.5, color = "#2E70B2", fill = "#6FA9E2") +
  geom_smooth(method=lm, se=FALSE, color = 'black')

As we can see, there is no clear trend between citric.acid and quality.

Red wine is divided into dry, semi dry, semi sweet and sweet because of residual.sugar,so we category the residual.sugar to hava a look at.

##category the residual.sugar 
wine$sugar_taste[wine$residual.sugar < 45] <- 'Semi-Sweet'
wine$sugar_taste[wine$residual.sugar < 12] <- 'Semi-Dry'
wine$sugar_taste[wine$residual.sugar < 4] <- 'Dry'

Category the quality to 3 types :

0-3 g/dm^3 Dry / 4-12 g/dm^3 Semi-Dry / 12-45 g/dm^3 Great /

Because the point numbers are much less than points in fixed.acidity vs. quality chart, so change jitter to show the point postion.

##residual.sugar vs. quality
ggplot(aes(x = residual.sugar, y = quality),
       data = wine) +
  geom_jitter(alpha = 0.5, color = "#2E70B2", fill = "#6FA9E2") +
  geom_smooth(method=lm, se=FALSE, color = 'black')

As we can see in the residual.sugar vs. quality chart,the most wine is below 4(g / dm ^ 3), but in this part of wine get quality score from 3 to 8,so there is no clear relationship between residual.sugar and quality.

## alcohol vs. quality
ggplot(aes(x = alcohol, y = quality), 
       data = wine) +
  geom_jitter(alpha = 0.5, color = "#2E70B2", fill = "#6FA9E2") +
  geom_smooth(method=lm,color = 'black',se = FALSE)

One important factor for wine is the grapes,the quality of the grapes is the fundamental of making wine, during the production, the sugar in grapes is converted into alcohol. So the alcohol will measure the quality of the grapes in that year.That is the reason why to choose alcohol to be the main factor to investigate quality.

##alcohol vs. quality.bucket 
ggplot(aes(x = quality.bucket, y = alcohol), 
       data = wine) + 
  geom_boxplot()

use boxplot to show the relationship between alcohol and quality.bucket We can see the trend that the ‘Great’ wine have more alcohol index.

## quality.bucket vs. fixed.acidity
ggplot(aes(x = quality.bucket, y = fixed.acidity), 
       data = wine) + 
  geom_boxplot()

use boxplot to show the relationship between fixed.acidity and quality.bucket We can see the trend that the ‘Great’ wine have more fixed.acidity index.

Other feature with strong correlation

从之前的相关性对比图中找到与质量特征没有关系,但相对关系很强的特征

##fixed.acidity vs. density and color
ggplot(aes(x = fixed.acidity, y = density), 
       data = wine) +
  geom_jitter(alpha = 0.5,color = "#2E70B2", fill = "#6FA9E2") +
  geom_smooth(method=lm,se = FALSE, color = 'black')

Use scatterplot to see the relationship between fixed.acidity and density, and we can see pretty strong positive correlation.

cor.test(wine$fixed.acidity,wine$density)
## 
##  Pearson's product-moment correlation
## 
## data:  wine$fixed.acidity and wine$density
## t = 35.877, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6399847 0.6943302
## sample estimates:
##       cor 
## 0.6680473

Pearson’s product-moment correlation show the same trend, and an exact number 0.6680473

##fixed.acidity vs. pH and color 
ggplot(aes(x = pH, y = fixed.acidity), 
       data = wine) +
  geom_jitter(alpha = 0.5 , color = "#2E70B2", fill = "#6FA9E2")  +
  geom_smooth(method=lm,se = FALSE, color ='black')

Use scatterplot to see the relationship between fixed.acidity and pH, and we can see pretty strong negative correlation.

cor.test(wine$fixed.acidity,wine$pH)
## 
##  Pearson's product-moment correlation
## 
## data:  wine$fixed.acidity and wine$pH
## t = -37.366, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7082857 -0.6559174
## sample estimates:
##        cor 
## -0.6829782

Pearson’s product-moment correlation show the same trend, and an exact number -0.6829782

##free.sulfur.dioxide vs. total.sulfur.dioxide 
ggplot(aes(x = free.sulfur.dioxide, y = total.sulfur.dioxide), 
       data = wine) +
  geom_jitter(alpha = 0.5, color = "#2E70B2", fill = "#6FA9E2") +
  geom_smooth(method=lm,se = FALSE, color ='black')

Use scatterplot to see the relationship between free.sulfur.dioxide and total.sulfur.dioxide, and we can see pretty strong negative correlation.

cor.test(wine$free.sulfur.dioxide,wine$total.sulfur.dioxide)
## 
##  Pearson's product-moment correlation
## 
## data:  wine$free.sulfur.dioxide and wine$total.sulfur.dioxide
## t = 35.84, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6395786 0.6939740
## sample estimates:
##       cor 
## 0.6676665

Pearson’s product-moment correlation show the same trend, and an exact number 0.6676665

##fixed.acidity vs. citric.acid 
ggplot(aes(x = fixed.acidity, y = citric.acid), 
       data = wine) +
  geom_jitter(alpha = 0.5, color = "#2E70B2", fill = "#6FA9E2") +
  geom_smooth(method=lm,se = FALSE, color ='black')

Use scatterplot to see the relationship between fixed.acidity and citric.acidity, and we can see pretty strong positive correlation.

cor.test(wine$citric.acid,wine$fixed.acidity)
## 
##  Pearson's product-moment correlation
## 
## data:  wine$citric.acid and wine$fixed.acidity
## t = 36.234, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6438839 0.6977493
## sample estimates:
##       cor 
## 0.6717034

Pearson’s product-moment correlation show the same trend, and an exact number 0.6717034, and this number is the most so far.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Opposide of what I thoght before, citric.acid, fixed.acidity, residual.sugar , do not have strong correlation with quality, but alcohol show a little positive trend with quality.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

  • fixed.acidity and citric.acid These two feature are from the grapes, and we can taste in the wine.

_ free.sulfur.dioxide and total.sulfur.dioxide

  • fixed.acidity and density

_ fixed.acidity and pH

What was the strongest relationship you found?

The strongest relationship is fixed.acidity vs. pH -0.6829782

Multivariate Analysis

##fixed.acidity vs. pH and color by quality.bucket
ggplot(aes(x = pH, y = fixed.acidity, color = quality.bucket), 
       data = wine) +
  geom_jitter() +
  coord_cartesian() +
  scale_color_brewer(type = "seq")+
  geom_smooth(method=lm,se=FALSE)

从图中所有点的走向来看,正如之前的Bivariate Analysis 一样,成反比例的趋势。 使用了quality.bucket做分类,并没有发现类别不同的点之间的共同趋势

##fixed.acidity vs. density and color by quality.bucket
ggplot(aes(x = density, y = fixed.acidity, color = quality.bucket), 
       data = wine) +
  geom_jitter() +
  coord_cartesian() +
  scale_color_brewer(type = "seq") +
  scale_y_continuous(limits = c(0,16)) +
  geom_smooth(method=lm,se=FALSE)

The chart is the correlation between density and fixed.acidity, color the points by quality.bucket, and we can see the layers in three types.The great quality wine have lightly more density than the bad one.

##free.sulfur.dioxide vs. total.sulfur.dioxide and color by quality.bucket
ggplot(aes(x = free.sulfur.dioxide, y = total.sulfur.dioxide, 
           color = quality.bucket), 
       data = wine) +
  geom_jitter(alpha = 0.5) +
  coord_cartesian() +
  scale_color_brewer(type = "seq") +
  geom_smooth(method=lm,se=FALSE)

在之前Bivariate Analysis部分探索过两者的关系,如今用quality.bucket做颜色分类,没有发现明显趋势

##fixed.acidity vs. citric.acid and color by quality.bucket
ggplot(aes(x = fixed.acidity, y = citric.acid,
           color = quality.bucket), 
       data = wine) +
  geom_jitter() +
  coord_cartesian() +
  scale_color_brewer(type = "seq") +
  geom_smooth(method=lm,se=FALSE)

在之前Bivariate Analysis部分探索过两者的关系,如今用quality.bucket做颜色分类,没有发现明显趋势

##alcohol vs. density and color by quality.bucket
ggplot(aes(x = alcohol, y = fixed.acidity, color = quality.bucket), 
       data = wine) +
  geom_jitter() +
  coord_cartesian() +
  scale_color_brewer(type = "seq") +
  geom_smooth(method=lm,se=FALSE)

在之前Bivariate Analysis部分探索过两者的关系,如今用quality.bucket做颜色分类,没有发现明显趋势

##alcohol vs. residual.sugar and color by quality.bucket
ggplot(aes(x = alcohol, y = residual.sugar, color = quality.bucket), 
       data = wine) +
  geom_jitter() +
  coord_cartesian() +
  scale_color_brewer(type = "seq") +
  geom_smooth(method=lm,se=FALSE)

在之前Bivariate Analysis部分探索过两者的关系,如今用quality.bucket做颜色分类,没有发现明显趋势

##alcohol vs. density and color by quality.bucket
ggplot(aes(x = alcohol, y = density, color = quality.bucket), 
       data = wine) +
  geom_jitter() +
  coord_cartesian() +
  scale_color_brewer(type = "seq") +
  geom_smooth(method=lm,se=FALSE)

在之前Bivariate Analysis部分探索过两者的关系,如今用quality.bucket做颜色分类,没有发现明显趋势

Multivariate Analysis

When we color by quality.bucket, we can see the layers of these feature, how they strength each other or not.

The most interesting group is fixed.acidity vs. density, we already see the strong correlation earlier.

Now, we can not only see the clear trend, but also we can see how to score a wine better quality.

##alcohol vs. quality.bucket and color by sugar_taste
ggplot(aes(x = quality.bucket, y = alcohol, color = sugar_taste), 
       data = wine) + 
  geom_boxplot()

在Dry和Semi-Dry两种类别的酒,酒精随着质量评定的升高而略有升高

##fixed.acidity vs. quality.bucket and color by sugar_taste
ggplot(aes(x = quality.bucket, y = fixed.acidity, color = sugar_taste), 
       data = wine) + 
  geom_boxplot()

使用了sugar_taste做颜色分类之后,没有发现明显趋势

##residual.sugar vs. quality.bucket and color by sugar_taste
ggplot(aes(x = quality.bucket, y = residual.sugar, color = sugar_taste), 
       data = wine) + 
  geom_boxplot()

Now put the sugar_taste category into the residual.sugar vs. quality category chart, we can find except only ‘Good’ type have ‘Semi-sweet’ type wine, other two types of wine, in ‘Dry’ and ‘Semi-Dry’, we can see the positive trend between residual.sugar and quality.

Final Plots and Summary

Plot One

## Warning: `geom_bar()` no longer has a `binwidth` parameter. Please use
## `geom_histogram()` instead.

Description One

The distribution of quality in this chart, 82.5% is in the range 5 to 6.

Plot Two

The chart is category by quality and we can find the alcohol trend in three different types of quality.The greate wine have clearly more alcohol, that means the grades of great score wine get more light and accumulate more sugar than the bad one.

葡萄在生长期接受阳光的照射,在葡萄果实里产生果糖,然后果糖在酿酒的过程中会转换成为酒精alcohol,没有完全转化为酒精的果糖,会留在酒液里成为残糖residual.sugar,所以一般来说,在残糖residual.sugar相同的情况下,其他条件完全一致的情况下,我们可以从酒精度来反推果实的成熟度(果糖含量)。而果实的成熟度是鉴定酒质量非常非常重要的标准。果实的成熟度越高,果糖更高,酒精度就会更高;反之,果糖比较低,酒精度就会比较低。葡萄牙的北部是个相对寒冷的产区,葡萄生长期接受的阳光不多,所以红酒的酒精度在资料里一般不会很高,在之前的酒精度的柱状图中可以看到是8.4- 14.9的区间,但是像是澳大利亚这种阳光充沛的新产区,如果在酿酒时不加干涉的话,酒精度会明显高于这个区间。(当然没人想喝这样的酒)

果实的成熟度左右了葡萄酒的大年和小年之分。大年就是葡萄成熟的非常好的年份,这些年份的葡萄酒评分一般都会高;而小年是葡萄成熟不太好的年份,最有可能的情况是光照不足,或者病虫灾害,冰雹雨雪天气等等,使得葡萄里的糖分不高,于是酿酒之后,酒的评分表现不好。

我在这里sugar不是指残糖,而是指的是葡萄中的果糖。

所以我认为,对于Vinho_Verde这种产自葡萄牙北部、酿造方法简单的酒,酒中的alcohol可以侧面发现这一年或者这一产区葡萄的成熟度。

## residual.sugar vs.quality.bucket and color by sugar_taste
ggplot(aes(x = quality.bucket, y = residual.sugar,
           color = sugar_taste), 
       data = wine) + 
  geom_boxplot() +
  ggtitle('Residual.Sugar vs.Quality.Bucket And Color By Sugar_taste') +
  xlab('Quality Category') +
  ylab('Residual.Sugar(g/dm^3)')

从图中可以看出,’Dry’和’Semi-Dry’两种酒的类型中,residual.sugar的中位数,随着质量评价的升高而有所升高。即,residual.sugar的含量和质量评级有相关性。

Plot Three

Description Three

density 和 fixed.acidity呈线性正相关,随着fixed.acidity的升高,density也有升高的趋势,加入质量分组之后,可以看到随着fixed.acidity的升高,高质量的葡萄酒有所增加,而density对质量影响不大(因为质量是在垂直风向上提高的)。

Reflection

这个数据集中,很难从单个特征直接判断更高评分的酒会是怎样的,我们只能轻微看出一些趋势。

其实葡萄酒的质量如何,在这个数据集中更确切得说是3位专家打的分数如何,并不仅仅是里面的化学成分是所完全决定的。

其中有一项为残糖量,但是酿造工艺的不同,会导致检测有相同残糖的两瓶酒,会得到非常不同的评价。因为有些地方像是加拿大会延迟采摘葡萄的时间让糖分沉淀;而有些地方的酿造工艺中是会人为二次添加糖分;而有些地方会人为中止发酵,让糖分不完全转化成酒精。这就意味着其实可以从市面上找到酒精度、pH值、糖分残留等化学物质指数非常相似的产品,但是风味和口感上面会相差千里,尤其对于不同的葡萄品种来说,通过酿造工艺给酒添加更加不同却十分和谐的‘风味’是评价一瓶酒好坏很重要的一个指标。(因为评价一瓶酒不仅会从结果上面来评价,也会从酿酒师解决问题的工艺来评价酒)

比如有两瓶酒都为13度的酒精度,口感都是中高酸度,ph也差不多,但是一个会出现接骨木花、蔓越莓、山楂糕、汽油的香气,另外一个会出现紫罗兰、野桑葚、野蓝莓、马厩的气息。品鉴葡萄酒时会从颜色、气味、口感来判断葡萄酒质量的好坏,而气味和风味口感还分为葡萄品种本身带来的不同、生长期的环境的不同、酿造工艺和过程赋予的香气不同。在这个数据集里,我们并不知道葡萄的品种,也没有告知是具体年份,和酿造工艺,而数据集中的特征还没有办法能判断出这些来。

品鉴一瓶葡萄酒的时候,有四个比较主要的因素:质感厚度,风味强度,涩度,酸度。但是这些因素都是从感官角度上出发,很难将化学成分与这些因素结合起来,比如质感厚度包含了哪几个化学成分。如果能知道这些化学成分在专家评价葡萄酒的时候占了多少的比重,可能会让这个分析变得更清晰。这一点需要数据分析相关人员和品鉴葡萄酒的专家深度合作,将这种感官体验和定量分析中的量化指标相结合。

另外葡萄酒的好坏,其实有很大程度上是非常主观的。比如刚接触葡萄酒或者喜欢甜点的人会更偏好甜型酒,比如晚收,低醇甜白低泡,或者是贵腐。即使这些酒更易饮,也广受好评,但是在专家的眼里,这些是上不了台面的,他们会觉得会品鉴高酸,才算是专业的体现,所以甜型酒和半甜型酒一直都获得市场的认可,但是却没有获得专业的褒奖。所以,从这一点看,也许应该用更细分的方式,将葡萄酒的品鉴标准量化,制定清晰的标准,因为到目前为止,这还是一个高度主观的行业。

挫败:

我不知道如何将品鉴葡萄酒的感官体验用化学成分量化的方法体现出来

而且我在做的时候,并不知道寻常评定葡萄酒质量的几个指标,在这个数据集里是如何分配比重的,因为这个酒不能陈年,是新鲜易饮的,而且单个特征与质量之间没有明显趋势,于是我就在考虑是不是存在一个模型,但是我目前还不会建模。

成功发现的部分:

plot two 的第二张图,Residual Sugar vs. Quality.Bucket 发现如果评价更高,葡萄酒中残糖会略有升高。我认为这可能是由于这批酒产自葡萄牙北部,影响葡萄酒品质最严重的因素就在于葡萄可能在生长期时没有得到足量的光照而带有尖酸和高酸,而酒液中适度的残糖可以中和酸度,让酒液更柔顺,所以会带来更高的评分。

特别要指出来的一点是:

正如之前我所说过的,葡萄酒的品鉴是个高度主观的事情,这个数据集里的红酒质量,更确切得说是3位专家打的分数如何,其实是会随着专家的口味偏好不同而高低起伏,所以,这个数据集准确来说,并不是在衡量红酒质量如何,不是去找红酒质量评分与化学成分的相关关系。准确来说,这个数据集其实是找到了打分的三位专家的口味偏好。

Reference